1 Introduction

This paper discusses our project using data on housing prices in Melbourne in 2017. The establishment of a real estate price prediction system is a key task for the healthy development of the current real estate industry. Having a simple predictive and inferential method to model housing prices helps commerce determine fair prices and allows governments to determine property taxes. This project aims to learn how different factors may affect home sales price by building linear models. Although data that will be utilized was collected in Melbourne, Australia in 2017, the concept that location and home attributes correlate with housing prices could reasonably apply broadly and internationally.

The following questions are the main subjects which this project focuses on:

  1. Understand if housing prices in Melbourne, Australia can be predicted using this dataset.
  2. Determine what variables have the greatest impact on housing price.
  3. Analyze the impacts of location, seller, and construction attributes of homes on the housing market in Melbourne, Australia.

1.1 The Melbourne Housing Snapshot Dataset

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

1.2 The Variables

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

1.3 Goals

  1. Understand which attributes of a home and its sale determine final sale price
  2. Attempt to build a reasonable model for inference and/or prediction for final sale price

2 Exploratory Data Analysis (EDA)

The following are excerpts and graphs from our exploratory data analysis. This part of the project familiarizes the reader with our dataset’s attributes as well as lays the foundation for the variables will include in our linear model. The results of our EDA will also inform the future direction of the project.

2.1 Select Data Pairs

2.2 Corrleations

2.3 Map of Melbourne Sales

2.4 Selling Price

2.4.1 Summary of Price Statistics

Mean: $1,075,684

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

2.4.2 Selling Price Distribution

2.4.3 Log Selling Price Distribution

2.4.4 Price by Region

2.4.5 Price by Number of Rooms (<10 Rooms)

2.4.6 Price by Type of Home

2.5 Test of Independence (Pearson \(\chi^2\)) by Group: Type, Rooms, Regionname, SellerG

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

2.6 Price by Region and Type

3 Linear Modelling

3.1 First Attempt at Linear Model

3.2 Linear Model 2: Removed the Variable with Highest VIF

3.2.1 Model Coefficients

3.3 Linear Model 3: Considered Interactions

3.4 Linear Model 4: Removed Land Size

3.4.1 Model 4 Coefficients

3.4.2 Residual Analysis

3.4.2.1 Normality

3.4.2.2 Influence

3.4.3 Remove Influence Points

3.5 Transformation of Selling Price

3.5.1 Transformed Data - Homogeneity

3.5.2 Transformed Data - Normality

4 Proposed Model

Observations 4955 (4548 missing obs. deleted)
Dependent variable Price
Type OLS linear regression
F(15,4939) 425.77
0.56
Adj. R² 0.56
Est. S.E. t val. p
(Intercept) -129445309.64 17287919.00 -7.49 0.00
Rooms 255713.04 9273.23 27.58 0.00
Distance -44501.15 1467.41 -30.33 0.00
Bathroom 115532.73 11453.40 10.09 0.00
Car 45801.33 7532.08 6.08 0.00
BuildingArea 1794.92 90.67 19.80 0.00
Lattitude -757249.47 124821.36 -6.07 0.00
Longtitude 696894.60 116542.04 5.98 0.00
Propertycount -3.69 1.52 -2.42 0.02
factor(Regionname)Eastern Victoria 188304.28 103229.34 1.82 0.07
factor(Regionname)Northern Metropolitan -55889.99 30219.79 -1.85 0.06
factor(Regionname)Northern Victoria 598550.10 116506.82 5.14 0.00
factor(Regionname)South-Eastern Metropolitan 169831.25 51506.18 3.30 0.00
factor(Regionname)Southern Metropolitan 212777.72 27309.69 7.79 0.00
factor(Regionname)Western Metropolitan -86943.29 38731.01 -2.24 0.02
factor(Regionname)Western Victoria 515064.40 135131.01 3.81 0.00
Standard errors: OLS

4.1 Testing \(R^2\)

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

5 Conclusion

5.1 Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • What to do about factors with many levels (100’s)?
  • Missing data
  • Improve Prediction

6 Bibliography